-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add design doc for lookup remote table in Fluid #9068
Add design doc for lookup remote table in Fluid #9068
Conversation
@@ -0,0 +1,44 @@ | |||
# Design Doc: Large Model |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need a more meaningful name, like "remote large parameter prefetching"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, maybe Prefetching Parameter From Parameter Server
sounds good?
@@ -0,0 +1,44 @@ | |||
# Design Doc: Large Model | |||
|
|||
## Abstract |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need to tell about the background, why we need this feature.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
|
||
### Split Large Parameter | ||
|
||
<img src="src/split_parameter.png" width="400" /> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems the picture's number's are wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @Yancey1989 The document required some polishing in the grammar. I have directly pushed a commit to this PR to refine and polish the document.
|
||
## Abstract | ||
|
||
We propose an approach to prefetch parameter from Parameter |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pre-fetch
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should be:
We propose an approach to pre-fetch the parameters from a Parameter Server while distributed training so that Fluid is able to train a model with a large number of parameters that cannot be stored in one trainer's memory.
Thanks for the design doc! Curious how much time can prefetch save in our use case? From my understand this is how prefetch save time:
prefetch could overlap fetch with the time "some-OP-A" and "some-OP-B" take. However, is it true that in our case the "OP-that-use-the-prefetched-value" is in the very beginning of the ProgramDesc, thus even if we insert the "pre-fetch-OP" at the beginning of the ProgramDesc, it would not save much time? E.g.,
|
@helinwang Well, I think prefetch does not intend to save time but to make training with very large parameter distributed on many pservers(because the feature space is very large). This feature could definitely slow down the training but very useful for some CTR models. |
Thanks @abhinavarora, that looks better😆😆 |
@typhoonzero I see, that makes sense, thanks for the reply! |
Please review this design #9075 first, for it has something to do with Abacus migration. We should consider being compatible with both sides. |
I changed the title to |
For another design, we can implement a distributed sparse table in Fluid, | ||
and don't need to maintain an external storage component while training. | ||
|
||
Prior to reading this design, it would be useful for the reader to make themselves |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You may need to read ... before going on.
![fluid lookup remote table](./src/fluid_lookup_remote_table.png) | ||
|
||
Partition a large table into multiple pserver instances | ||
1. `DistributeTranspiler` would split the table partitioned into some small |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we only use mod for now, but never mind, it's a design.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The mod is used for finding the right pserver according to the input Id, and we also need to initialize the shape of the table block on each PServer by split.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, we may need to remove old design sections later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will update the outdated part of this design.
…gn_doc Add design doc for lookup remote table in Fluid
Fixed #9066